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Abstract: Estimation in the deformable template model is a big challenge in image analy- 
sis. The issue is to estimate an atlas of a population. This atlas contains a template and the 
corresponding geometrical variability of the observed shapes. The goal is to propose an accu- 
rate algorithm with low computational cost and with theoretical guaranties of relevance. This 
becomes very demanding when dealing with high dimensional data which is particularly the 

S case of medical images. We propose to use an optimized Monte Carlo Markov Chain (MCMC) 
method into a stochastic Expectation Maximization (EM) algorithm in order to estimate the 
I I model parameters by maximizing the likelihood. We present a new Anisotropic Metropolis 

Adjusted Langevin Algorithm (AMALA) which is used as transition in the MCMC method. 
We first prove that this new sampler leads to a geometrically uniformly ergodic Markov 
^ chain. We prove also that under mild conditions, the estimated parameters converge almost 

OO surely and are asymptotically Gaussian distributed. The methodology developed is then 

ff) tested on handwritten digits and some 2D and 3D medical images for the deformable model 

estimation. More widely, the proposed algorithm can be used for a large range of models in 
many field of applications such as pharmacology or genetic. 
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1. Introduction 



We consider here the deformable template model introduced for Computational Anatomy by 
Grenander and Miller (1998). This model, which has demonstrated great impact in image analysis, 
was developed and analyzed later on by many groups (among other Marsland and Twining (2004); 
Miller ct al (2009, 2002); Vercauteren et al (2009)). It offers several major advantages. First, it 
enables to describe the population of interest by a digital anatomical template. Then, it captures 
the geometric variability of the population shapes through the modeling of deformations of the 
template which match it to the observations. Moreover, the metric on the space of deformations 
is specified in the model as a quantification of the deformation cost. Not only describing the 
population, this generative model also allows to generate synthetic data using both the template 
and the geometrical metric of the deformation space which together define the atlas. Nevertheless, 
the key statistical issue is how to estimate efficiently and accurately these parameters of the model 
from an observed population of images. 
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Several numerical methods have been developed mainly for the estimation of the template im- 
age (for example Cootes et al (1995); Joshi et al (2004)). Even if these methods lead to interesting 
results on some training samples, they suffer from a lack of theoretical properties and are not 
robust to noisy data. Another important contribution toward the statistical formulation of the 
issue of template estimation was proposed by Glasbey and Mardia (2001). However interesting 
this approach is not entirely satisfactory since the deformations are applied to discrete obser- 
vations requiring some interpolation. Moreover it does not formulate the analysis in terms of a 
generative model which appears very attractive as mentioned above. To overcome these lacks, 
Allassonniere et al (2007) have formulated a coherent statistical generative model. For estimating 
all the model parameters, the template image together with the geometrical metric, the authors 
proposed a deterministic algorithm based on an approximation of the well-known Expectation 
Maximization (EM) algorithm (Dempster et al (1977)), where the posterior distribution is re- 
placed by a Dirac measure on its mode (called FAM-EM). However, such an approximation leads 
to the non-convergence of the estimates highlighted when considering noisy observations. 

One solution to face this problem is to consider a convergent stochastic approximation of the 
EM (SAEM) algorithm which was proposed by Dclyon et al (1999). An extension using Monte 
Carlo Markov Chain (MCMC) methods was developed and studied by Kuhn and Lavielle (2004) 
and Allassonniere et al (2010b) allowing for wider applications. To apply this extension to the 
deformable template model, Allassonniere et al (2010b) chose a Metropolis Hastings within Gibbs 
sampler (also called hybrid Gibbs) as MCMC method since the variables to sample were of large 
dimension (the usual Metropolis Hastings algorithm providing low acceptation rates). This esti- 
mation algorithm has been proved convergent and performs very well on very different kind of data 
as presented by Allassonniere et al (2010a). Nevertheless, the hybrid Gibbs sampler becomes com- 
putationally very expensive when sampling very high dimensional variables. Although it reduces 
the dimension of the sampling to one which enables to stride easier the target density support, it 
loops over the sampling variable coordinates, which becomes computationally unusable as soon as 
the dimension is very large or as the acceptation ratio involves heavy computations. To overcome 
the problem of computational cost of this estimation algorithm, some authors propose to simplify 
the statistical model constraining the correlations of the deformations (c.f. Maire et al (2011); 
Richard et al (2009)). Our purpose in this paper is to propose an efficient and convergent esti- 
mation algorithm for the deformable template model in high dimension without any constrains. 
With regards to the above considerations, the computational cost of the estimation algorithm can 
be reduced by optimizing the sampling scheme in the MCMC method. 

The sampling of high dimensional variables is a well-known difficult challenge. In particular, 
many authors have proposed to use the Metropolis Adjusted Langevin Algorithm (MALA) (see 
Roberts and Tweedie (1996) and Stramer and Tweedie (1999a)). This algorithm is a particular 
random walk Metropolis-Hastings sampler. Starting from the current iterate of the Markov chain, 
one simulates a candidate with respect to a Gaussian proposal with an expectation equal to the 
sum of this current iterate and a drift related to the target distribution. The covariance matrix 
is diagonal and isotropic. This candidate is accepted or rejected with a probability given by the 
Metropolis Hastings acceptance ratio. 

Some modifications have been proposed in particular to optimize the covariance matrix of the 
proposal in order to better stride the support of the target distribution (see Atchadc (2006); 
Girolami and Calderhead (2011); Marshall and Roberts (2012); Stramer and Tweedie (1999b)). 
Atchade (2006) and Marshall and Roberts (2012) proposed to construct adaptive MALA chains 
for which they prove the geometric ergodicity of the chain uniformly on any compact subset of 
its parameters. Unfortunately, this technique does not take the whole advantage of changing the 
proposal using the target distribution. In particular, the covariance matrix of the proposal is given 
by a stochastic approximation of the empirical covariance matrix. This choice seems completely 
relevant as soon as the convergence toward the stationary distribution is reached. However, it 
does not provide a good guess of the variability during the first iterations of the chain since 
it is still very dependent on the initialization. This leads to chains that may be numerically 
trapped. Moreover, this particular algorithm may require a lot of tuning parameters. Although 
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the theoretical convergence is proved, this algorithm may be very difficult to optimize in practice 
into an estimation process. 

Recently, Girolami and Calderhead (2011) proposed the Riemann manifold Langevin algorithm 
in order to sample from target density in high dimensional setting with strong correlations. This 
algorithm is also a MALA based one for which the choice of the proposal covariance is guided by 
the metric of the underlying Riemann manifold. It requires to evaluate the metric, its inverse as 
well as its derivatives. The proposed well-suited metric is the Fisher-Rao information matrix or its 
empirical value. However, in the context we are dealing with, the real metric, namely the metric of 
the space of non-rigid deformations, is not explicit (a particular case is calculated by Micheli et al 
(2012)) preventing from any use of it. Moreover, if we consider the constant curvature simplifica- 
tion suggested by Girolami and Calderhead (2011), one still needs to invert the metric which may 
be neither explicit nor computationally tractable. Note that these constrains are common with 
other application fields such as genetic or pharmacology, where the models are often complex. 

For all these reasons, we propose to adapt the MALA algorithm in the spirit of both works of 
Atchadc (2006) and Girolami and Calderhead (2011) to get an efficient sampler into the stochastic 
EM algorithm. Therefore, we propose to sample from a proposal distribution which has the same 
expectation as the MALA but using a full anisotropic covariance matrix based on the anisotropy 
and correlations of the target distribution (called in the sequel AMALA) . The expectation is ob- 
tained as the sum of the current iterate plus a drift which is proportional to the gradient of the 
logarithm of the target distribution. We construct the covariance matrix as a regularization of the 
Gram matrix of this drift. We prove the geometric ergodicity uniformly on any compact set of the 
AMALA assuming some regularity conditions on the target distribution. We also prove the almost 
sure convergence of the parameter estimated sequence generated by the coupling of AMALA and 
SAEM algorithms (AMALA-SAEM) toward the maximum likelihood estimate under some regu- 
larity assumptions on the model. Moreover, we prove a Central Limit Theorem for this sequence 
under weak conditions on the model. We test our estimation algorithm on the deformable template 
model for estimating hand-written digit atlases from the USPS database and medical images of 
corpus callosum (2D) and of dendrite spine excrescences (3D). The proposed estimation method 
is compared with the results obtained from the FAM-EM algorithm and from the MCMC-SAEM 
algorithm using different samplers namely the hybrid Gibbs sampler, the MALA and the adaptive 
MALA proposed by Atchade (2006) previously introduced. The comparison is also made via clas- 
sification rates on the USPS database. These experiments demonstrate the good behavior of our 
method in both the accuracy of the estimation and the low computational cost in high dimension. 

The paper is organized as follows. In Section 2, we recall the Bayesian Mixed Effect (BME) 
template model. In Section 3, we consider the estimation issue in general framework of missing 
data models. We present our stochastic version of the EM algorithm using the AMALA sampler. 
The convergence properties are established in Section 4. Section 5 is devoted to the experiments 
on the BME template estimation. Finally, we give some conclusion in Section 6. The proofs are 
postponed in Section 7. 

2. Description of the Bayesian Mixed Effect (BME) Template model 

The deformable template model aims at summarizing a population of images by two quantities. 
The first one is a mean image which has to represent a relevant shape as one could find in the 
population. The second quantity represents the variance in the space of shapes. This corresponds to 
the geometrical variability around the mean shape. In practice, numerical approximations of both 
quantities are obtained through statistical estimation process. Let us now describe the deformable 
template model. 

We consider the hierarchical Bayesian framework for dense deformable template developed by 
Allassonniere et al (2007) where each image in a population is assumed to be generated as a noisy 
and randomly deformed version of the mean image also called template. 
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The database is composed of n grey level images {yi)\<i< n observed on a grid A of pixels (or 
voxels) {v u £ D, u £ A} included in a continuous domain D C M. d , (typically D = [— 1, l] d where d 
equals 2 or 3). The expected template Io : M. d — > K takes its values in the continuous domain. Each 
observation y is assumed to be a discretization on A of a random deformation of this template 
plus an independent noise. Therefore, there exists an unobserved deformation field (also called 
mapping) m : R d —> R d such that for u £ A 

y( u ) = Io( v u - m(v u )) + cre(u) , 

where ae denotes the independent additive noise and v u is the location of pixel (or voxel) u. 

Considering the template and the deformations as continuous functions would lead to a dense 
problem. The dimension is reduced assuming that both elements belong to a subset of fixed 
Reproducing Kernel Hilbert Spaces (RKHS) V p and V g defined by their respective kernels K p and 
Kg. More precisely, let { r P ,j)i<j<k p -respectively (r g .j)i<j<k s - be some fixed control points in the 
domain D: there exist a £ M. kp -resp. z £ R ks x R kg - such that for all v in D: 

I a (v) = (K p a)(v) = ^2 K p (v, r Pij )a j and 

m z (v) = (K g z)(v) • 

For clarity, write y = (yi)i<i< n for the n— tuple of observations and z = (zi)i<i< n for the 
n— tuple of unobserved variables defining the deformations. The statistical model on the observa- 
tions is chosen as follows: 

s~®2 =1 ^«» e (o,r 1 ,) | r B , 

(2.1) 

V ~ ®?=iN\A\(m z J a ,a 2 Id) | z,a,cr 2 , 

where denotes the product of independent variables and ml a (u) = I a (v u — m(v u )), for u in A. 
The parameters of interest are a (the template) , a 2 (the noise variance) and T g (the deformation 
covariance matrix). We assume that 9 = (a,a 2 ,T g ) belongs to the parameter space O: 

6 = { 9 = (a,a 2 ,r g ) | ael\ |a| < R, a > 0, T 5 e Sym+ ^(M) }, (2.2) 

where Sym^. t (K) is the cone of real positive dk g x dk g definite symmetric matrices, R an arbitrary 
positive constant and d is the space dimension (typically 2 or 3 for images). 

Since we aim at dealing with small size samples and high dimensional parameters, we work in 
the Bayesian framework and we introduce priors on the parameters. In addition of guiding the 
estimation it regularizes the estimation as shown by Allassonniere et al (2007). The priors are all 
independent: 9 = (a, a 2 , T g ) ~ v p ® v g where 

v p (da, da 2 ) oc exp (-^( a - Mp) T ( s p) _1 ( a ~ /%>) J x f exp da 2 da, a p > 3 , 




v g {dT g ) cc exp(-(r- i ,S ff ) F /2)^= dT g ,a g > Ak g + 1 . 



For two matrices A, B we define the Frobenius inner product by (A, B)p = tr{A T B). 

Parameter estimation for this model is then performed by Maximum A Posteriori (MAP) : 



9 = argmaxg(#|y) , 
6 
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where q(9\y) is the posterior density of 9 conditional on y. The existence and consistence of the 
MAP estimator for the BME template model has been proved by Allassonniere et al (2007). 

Note that this model belongs to a more general class called the mixed effect models. The 
fixed effects are the parameters 9 and the random effects are the deformation coefficients z. The 
estimation issue in this class is treated in the same way as the likelihood maximization problem 
in the more general framework of incomplete-data models. Therefore, the next section will be 
presented in this general setting in which the proposed algorithm applies. 

3. Maximum likelihood estimation 

3.1. Maximum likelihood estimation for incomplete data setting 

We consider in this section the standard incomplete-data (or partially-observed-data) setting and 
recall the usual notation. We denote by y £ M. q the observed data and by z £ M. 1 the missing 
data, so that we obtain the complete data (y,z) £ R q+l for some geN* and leN*. We consider 
these data as random vectors. Let a' be a cr-finite measure on M. q+l and \i the restriction of p! 
to W generated by the projection (y, z) z. We assume that the probability density function of 
the random vector (y, z) belongs to V = {f(y,z;9),9 £ 0}, a family of parametric probability 
density functions on ]R 9+ ' w.r.t. u' , where 9 C W. Therefore, the observed likelihood (i.e. the 
incomplete-data likelihood) is defined for some 9 £ by: 



Our purpose is to find the maximum likelihood estimate that is the value 9 g in that maxi- 
mizes the observed likelihood g given a sample of observations. However, this maximization can 
often not be done analytically because of the integration involved in (3.f ). A powerful tool which 
enables to compute this maximization in such a setting is the Expectation Maximization (EM) al- 
gorithm (see Dempster et al (1977)). It is an iterative procedure which consists of two steps. First, 
the so-called E-step computes the conditional expectation of the complete log-likelihood using the 
current parameter value. Second, the M-step achieves the update of the parameter by maximizing 
this expectation over 0. However, the computation of this expectation is often intractable analyt- 
ically. Therefore, alternative procedures have been proposed. We are particularly interested in the 
Stochastic Approximation EM (SAEM) algorithm (see Delyon et al (1999)) because of its theoret- 
ical convergence property and its small computation time. In this stochastic algorithm, the usual 
E-step is replaced by two steps, the first one corresponding to the simulation of realizations of the 
missing data, the second one to the computation of a stochastic approximation of the complete 
log-likelihood using these simulated values. It can be shown that under weak regularity conditions 
the sequence generated by this algorithm converges almost surely toward a local maximum of the 
observed likelihood (see Delyon et al (1999)). 

Nevertheless the simulation step requires some attention. In the SAEM algorithm the simulated 
values of the missing data have to be drawn from the posterior distribution defined by: 



When not possible, the extension using MCMC method (Allassonniere et al (2010b); Kuhn 
and Lavielle (2004)) allows to apply the SAEM algorithm using simulations obtained from some 
transition probability of an ergodic Markov Chain having the targeted posterior distribution as 
stationary distribution. Methods like Metropolis Hastings algorithm or Gibbs sampler are useful 
to perform this assignment. However, this becomes very challenging in high dimensional setting. 
Indeed, when the MCMC procedure has to explore a space of high dimension, its convergence may 
occur in practice only after a possibly infinite time. Thus, it is necessary to optimize this MCMC 
procedure. This is what we will propose in the following paragraph. 




(3.1) 
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3.2. Description of the sampling method: Anisotropic Metropolis Adjusted Langevin 
Algorithm 

We propose an anisotropic version of the well-known Metropolis Adjusted Langevin Algorithm 
(MALA). So let us first recall the steps of the Metropolis Adjusted Langevin Algorithm (MALA). 
Let X be an open subset of M. , the /—dimensional Euclidean space equipped with its Borel 
er— algebra B. Let us denote tt the probability density function (pdf ) with respect to the Lebesgue 
measure on X of the target distribution. We assume that 7r is positive continuously differentiable. 
At each iteration k of this algorithm, a candidate X c is simulated with respect to the Gaussian 
distribution with expectation X k + ^-D(X k ) and covariance a 2 Idi where X k is the current value, 

D(x)= TT-Jk 1 — \7\ ^ log 7r(x) , (3.2) 

v ; max(6, |Vlog7r(x)|) w x ' 

Idi is the identity matrix in M. 1 and b > a fixed truncated threshold. In the following, we denote 
1mala{x, •) the pdf of this Gaussian candidate distribution starting from x. Given this candidate, 
the next value of the Markov chain is updated using an acceptance ratio ctMALA{Xk, X c ) as follows: 

X k+1 = X c with probability a M ALA(X k ,X c ) = min (l, ^^^S)) and Xk+1 = Xk with 
probability 1 — aMALA{X k ,X c ). This provides a transition kernel Umala of this form: for any 
Borel set A e B 

Umala(x,A) = / a M ALA{x,z)q M ALA{x,z)dz + l A (x) / (I - chmala(x, z))Qmala(x, z)dz . 

J A J X 

(3.3) 

Note that the truncation of the drift was already suggested by Gilks et al (1996) to provide 
more stability. 

The Gaussian proposal of the MALA algorithm is optimized with respect to its expectation 
guided by the Langevin diffusion. One step further is to optimize also its covariance matrix. A 
first work in this direction was proposed by Atchadc (2006). The covariance matrix of the pro- 
posal is given by a projection of a stochastic approximation of the empirical covariance matrix. 
It produces an adaptive Markov chain. This process involves some additional tuning parameters 
which have to be calibrated. Since our goal is to use this sampler in an estimation algorithm, the 
sampler has at each iteration a different target distribution (depending on the current estimate of 
the parameter). Therefore, the optimal tuning parameter may be different along the iterations of 
the estimation process. Although we agree with the idea of using adaptive chain, we prefer taking 
the advantage of the dynamic of the estimation algorithm. On the other side, an intrinsic solution 
has been proposed by Girolami and Calderhead (2011) where the covariance matrix is given by 
the metric of the Riemann manifold of the variable to sample. Unfortunately, this metric may not 
be accessible and its empirical approximation not easy to compute. This is particularly the case 
in the BME template model. 

For these reasons, we propose a sampler in the spirit of Atchadc (2006), Marshall and Roberts 
(2012) or Girolami and Calderhead (2011) however not providing an adaptive chain as motivated 
above. The adaption comes from the dependency of the target distribution with respect to the 
parameters of the model which are updated along the estimation algorithm. The proposal is still a 
Gaussian distribution but both the drift and the covariance matrix depend on the gradient of the 
target distribution. At the k th iteration, we are provided with Xk- The candidate is sampled from 
the Gaussian distribution with expectation Xk + 5D(Xk) and covariance matrix <5£(Afc) denoted 
in the sequel 

Af(X k + 5D(X k ),8Ys(X k )) where S(x) is given as : 



E(x) = eldi + D(x)D(x) T , 



(3.4) 
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with D defined in Eq (3.2) and e > is a small regularization parameter. Note that the threshold 
parameter b leads to a symmetric positive definite covariance matrix with bounded non zero eigen 
values. We introduced the gradient of log n into the covariance matrix to provide an anisotropic 
covariance matrix depending on the amplitude of the drift at the current value. When the drift is 
large, the candidate is likely to be far from the current value. This large step may not be of the 
right amplitude and a large variance will enable more flexibility. On the other hand, when the drift 
is small in a particular direction, it means that the current value is close to an admissible state for 
the Markov chain. Therefore, the candidate should neither be allowed to move too far nor with a 
large variance. This is taken into account here by introducing the drift into the covariance matrix. 
We denote by q c the pdf of this proposal distribution. The transition kernel becomes: 

H(x,A)= / a(x, z)q c (x, z)dz + 1a(x) / (1 — a(x, z))q c (x, z)dz , (3-5) 

J A JX 

where a{X k , X c ) = min (l, ^^(S) ) ■ 

3.3. Description of the stochastic estimation algorithm 

Back to the stochastic estimation algorithm, the target distribution of the sampler is the posterior 
distribution p(-\y, 8). 

The four steps of the proposed AMALA-SAEM algorithm are detailed in this subsection : 
simulation, stochastic approximation, truncation on random boundaries and maximization steps. 
At each iteration k of the algorithm, simulated values of the missing data are drawn from the 
transition probability of the AMALA algorithm described in Section 3.2 with the current value of 
the parameters. Then, a stochastic approximation of the complete log-likelihood is computed using 
these simulated values for the missing data and is truncated using random boundaries. Finally, 
the parameters are updated by maximizing this quantity over 0. 

We consider here only parametric models V which belong to the set of curved exponential 
family, this means that the complete likelihood f(y, z; 8) can be written as: 

f(y,z;6) = exp[-^(8) + (S(z),<t>(0))} , 

where (•, •) denotes the Euclidean scalar product, the sufficient statistics S is a function on R , 
taking its values in a subset S of R m and tp, <j) are two functions on 8 (note that S, </> and tp 
may depend also on y, but we omit this dependency for simplicity). This condition is usual in the 
framework of EM algorithm applications and it is fulfilled by large range of models even complex 
ones as the BME template model. Therefore the stochastic approximation can be done either on 
the sufficient statistics S of the model or on the complete log-likelihood. 

Concerning the truncation procedure, we introduce a sequence of increasing compact subsets 
of S denoted by (IC q ) q >Q such that U q >oK. q = S and JC q C int(JC q +i), for all q > 0. Let also 
e = {e q ) q >o be a monotone non-increasing sequence of positive numbers and K a compact subset 
of M 1 . At iteration k we simulate a value z for the missing data from the Anisotropic Metropolis 
Adjusted Langevin Algorithm using the current value of the parameter 8k-i- We compute the 
associated stochastic approximation of the sufficient statistics of the model s. If it does not wander 
outside the current compact set JC k and if it is not too far from its previous value Sk-i, we keep 
the possible proposed values for (zf-,Sk). As soon as one of these conditions is not fulfilled, we 
reinitialize the sequences of z and s using a projection (for more details see Andrieu et al (2005) 
) and we increase the size of the compact set used for the truncation. 

Concerning the maximization step, we denote by L the function defined on S x taking values 
in M equals for all (s,8) to L(s,6) — —ij)(9) + (s, (f)(6)). We assume that there exists a function 8 
defined on S taking values in such that 

V0 € Vs € S L(s, 6(s)) > L(s, 8). 
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Finally we update the parameter using the value of the function 9 evaluated in s k - 

The complete algorithm is summarized in Algorithm 1. It only involves three parameters: b 
the threshold for the gradient which appears in the mean as well as in the covariance matrix, 
8 the scale on this gradient and e a small regularization parameter to ensure a positive definite 
covariance matrix. The scale d can be easily optimized looking at the data we are dealing with 
to adapt to the range of the drift. The value of the threshold b is in practice never reached. The 
practical choices for the sequence of positive step sizes used in the stochastic approximation step 
and the tuning parameters will be detailed in the section devoted to the experiments. 

Algorithm 1 AMALA within SAEM 

for all k = 1 : k en< i do 

Sample z c with respect to J\[(z k —\ + SD(zk-i , 9k— i), <52(zfc_i , 0k — l)) whose pdf is denoted q c (z k —i, .; 9 k —i) 
where 

£>(*fc_lA-l) = ^(fc.lviogptL^ly;^)!) X VlogpOfc-lMfe-l) 

S(z fc _i, 6> fe _i) = el^ + D(zk- 1 ,9 k - 1 )D{z k - 1 ,ek-i) T . 

Compute the acceptance ratio 

/ a \ ■ fi p(zc\y,8k-i)q c (zc,z k -i;8k-i) \ , . 

a(z fc _i,z c ,6» fc _ 1 )=mm 1,— — — (3.6) 

V qc(zk-i, z c ;8k-i)p{zk-i\y;9k-i) J 

Sample z = z c with probability a(z k _i, z c , k —i) and z = Z k —\ with probability 1 — a(z k —i, z c , 9k— i) 
Do the stochastic approximation 

s = s k -i + 7fe (S(z) - s k -i) , 

where (-fk)k is a sequence of positive step sizes, 
if s£ /C Kfc _ 1 and \\s — Sk-i | < ^Ck-i then 

Set (z k ,s k ) = (z,s) and n k = «fe-i, Vk = "k-i + 1> C* = Cfc-i + 1 
else 

set (zfc,Sfc) = (z,s) e K x K and n k = n k -i + l,u k = 0, ( k = ( k -i + *(^fc-i) 
where Vl 1 : N — > Z is a function such that \P(fc) > —k for any k 
and (z, s) is chosen arbitrarily, 
end if 

Update the parameter 

9 k =9(s k ). 

end for 



4. Theoretical Properties 

4-1- Geometric ergodicity of the AMALA 

The AMALA algorithm described in Subsection 3.2 creates a geometric ergodic Markov chain uni- 
formly on any compact subset. Atchade (2006) proved the geometric ergodicity for a bounded drift 
and a projection of the empirical covariance matrix. In our case, the same results hold using the 
previously detailed algorithm where the main difference stands in the anisotropic covariance of the 
proposal involving the gradient of log tt. An assumption is required on the stationary distribution 
namely the so-called super-exponential property given by: 

(Bl) The density 7r is positive with continuous first derivative such that: 

lim n(x).V log ir(x) = — oo (4-1) 



and 



limsupn(a;).m(a:) < 



(4.2) 
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where V is the gradient operator in M. 1 , n(x) = ^ is the unit vector pointing in the direction 

of x and m(x) — |y^)| is the unit vector in the direction of the gradient of the stationary 
distribution at point x. 

Proposition 1. Assume (Bl). Let < j3 < 1 and V(x) = ctt(x)~ i3 where c is such that Vie 
X, V(x) > 1. There exist a set CCA", a probability measure v such that z^(C) > and there exist 
constants A G]0, 1[, b e [0, oo[ and e €]0, 1] such that 

IL(x,A) > ev{A)t c {x) Vx € X MAeB, (4.3) 
ILV(x) < XV(x) + bt c (x) . (4.4) 

The first equation defines C as a small set for the transition kernel II. Note that both e and 
v can depend on C. The second inequality is a drift condition which states that the transition 
kernel tends to bring back elements into the small set. As a consequence of these well known drift 
conditions, the transition kernel II is ^-uniformly ergodic. That is to say: there exist < p < 1 
and < c < oo such that for all n € N* and / such that = sup < oo: 

l|ir/(.)-^llv<^ B ll/llv. (4.5) 

The proof of Proposition 1 is given in Appendix. 

Remark 1. The same property holds for any power p of the function V satisfying < pf3 < 1. 
Indeed, the proof follows the same way as it can be seen in Section Appendix 7. This is a property 
that will appear useful in the sequel to prove some properties of the estimation algorithm. 



4-2. Convergence property of the general stochastic approximation using AM ALA 

We first prove a general theorem on the convergence of the stochastic approximation using the 
Anisotropic Metropolis Adjusted Langevin Algorithm. Then, we apply this result to prove the 
convergence of the estimated sequence generated by the algorithm AMALA-SAEM. 

Let S be a subset of M. m for some positive integer m. Let X be a measurable subspace of M 1 for 
some positive integer I. For all s € S, let H s : X — > S be a measurable function. Let 7 = {^k)k 
be a sequence of positive step sizes. Let (~K s )ses be a family of positive continuously differentiable 
probability density functions with respect to the Lebesgue measure on X . For any s£5, denote 
by Ii s the transition kernel corresponding to the AM ALA procedure described in Section 3.2 with 
stationary distribution 7r s . 

Define now the stochastic approximation sequence {sk)k as follows : 

( s k = s k -i+~/ k -iH Sk _ 1 (z k ) with z k - n Sfc _ 1 (z fc _i, •) , if s fc _i e S , , 

\ s k = s c with z k = z c , ifs fc _!^5, 

where s c ^ S, z c ^ X. Denote by Q 7 the resulting transition which generates ((z k , s k )) k . We 
consider the natural filtration of the non-homogeneous chain ((z k , s k )) k and denote respectively 
by s and EJ S the probability measure and the corresponding expectation generated by this 
Markov chain starting at (zq, sq) G X x S and using the sequence 7. 

For any s £ S such that the function H s is integrable with respect to n s , we denote by h the 
mean field associated with our stochastic approximation so that : 

h(s) — J H s (z)n s (z)dz . 

Using the notation introduced in Subsection 3.3, let (f> : X x S —> K x /Co be a measurable 
function and W : N — > Z be a function such that ^(fc) > ~k for any k. We define the homogeneous 
Markov chain 

(Zk = (z k , s k ,K k ,(k,Vk))k (4.7) 
on Z = X x S x N 3 with the following transition at iteration k : 
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• If Uk—i = then draw 

(z k ,s k ) ~ Q 7Cfc i ($(zA;-i,Sfc_i),-); otherwise draw (z k ,s k ) ~ Q-y^^ ((zk-i, Sfe-i), Oi 

• If ||s fc - s fc _i|| < e^.j and s fc € /C Kfc _ 1 then set /c fe = /t fe _!, Oc = Cfe-l + 1 and = v k _ x + 1; 
otherwise set n k = >^k~i + 1, Cfc = Cfc-i + *(^fe-i) and i/ fc = 0. 



We consider the following assumptions, generalized from Andrieu ct al (2005) and Allassonniere 
ct al (2010b). 

(Al') S is an open subset of M. m , h : S —> M. m is continuous and there exists a continuously 
differentiable function w : S — > [0, oo[ with the following properties: 

(i) There exists an Mq > such that 

L = {seS,(Vw(s),h(s)) = Q}c{s€S, w(s) < M } . 

(ii) There exists a closed convex set S a C S for which s — > s + pH s (z) 6 S a for any p € [0, 1] 
and (z, s) € X x S a (S a is absorbing) and such that for any Mi e]M ,oo], the set 
Wmi n 5 a is a compact set of 5 where Wm 1 = {s £ 5, w(s) < Mi}. 

(iii) For any s e S\C (\7w{s) 7 h{s)) < 0. 

(iv) The closure of w(C) has an empty interior. 

(A2') For any s E S, H s : X S is measurable and J \\H s (z)\\tt s (z)<1z < oo. 



(A3") There exist a function V : X -> [l,oo] such that {z € <*,V(z) < oo} f 0, 
constants a g]0, 1], p > 2 and q > p such that for any compact subset K C S, 

(i) 

sup||iy s ||y< oo, (4.8) 
seic 

supdlfoHv + ||n.fl«||v) < oo, (4.9) 

seK 

sup \\s-s'\\- a {\\g s -g s ,\\ vq + (4.10) 

||n s g s -n s , 5s /|| v ,} < oo, (4.11) 

where for anyA s € 5 a solution of the Poisson equation g — II s g = — ir s (H s ) is 
denoted by <7 S . 

(ii) For any sequence e = (£fc)fc>o such that there exists e > satisfying e k < e for all 
k G N*, for any sequence 7 = (7fe)fc>o, there exists a constant C such that for any 

zex, 

supsupE7 s [VV{z k )t a{K)Av{e) > k ] < C K V%z) , (4.12) 

sG/C/c>0 

where v(e) = infjfc > 1, \\s k — Sfe-i|| > £&} and <r(K) = infjfc > l,Sfc ^ /C} and the 
expectation is related to the non-homogeneous Markov chain {{z k , s k )) k >o using the 
step-size sequence 7 = (7fc)fc>o- 

(A4) The sequences 7 = (7fc)fe>o and e = (e k ) k >o are non-increasing, positive and satisfy: 
00 00 

7fe = 00, lim e k = 0, {7fc + lk£ k + {lkS k ) p } < 00, where a and p are defined 
fc=o k=l 
in (A3"). 



We assume also some regularity properties of the stationary distribution with respect to s. 
(B2) For all z € X, the functions s H> tt s and s M- V z log^s are continuous on S. 
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Theorem 1 (Convergence Result for Stochastic Approximation). Assume (Al'),(A2'), (A3") 
and (A4)- Assume also that for alls £ S, the density function tt s satisfies assumptions (B1-B2). 
Let K C X be such that sup V(z) < oo and JCq C Wm H S a (where Mq is defined in (Al')), and 

let (Zk)k>o be the sequence defined in Equation (4.7). Then, for all zq £ K and sq £ JCq, we have 
lim d(sk,C) = Vz ,s ,o,o,o-a-s, where P 20)So ,o,o,o is the probability measure associated with the 

k— »oo 

chain (Z k = (z k , s k , K k , ( k , v k )) k > starting at (z , s , 0, 0, 0). 

Proof. The proof of this result is based on the theorem on general convergence result for stochastic 
approximation stated by Allassonniere et al (2010b) which generalized the one stated by Andrieu 
et al (2005). Assumptions (Al') and (A4) are the same. We prove that the choice of the Anisotropic 
Metropolis Adjusted Langevin Algorithm as the transition probability for the dynamic of the 
current Markov Chain combined with assumptions (A2') and (A3") implies that assumptions 
(A2) and (A3') of the theorem stated by Allassonniere et al (2010b) are fulfilled. The details are 
in Appendix 7. 

Remark 2. The differences between assumptions (A2'J, (A3") and (A2) and (A3') from Allas- 
sonniere et al (2010b) come from the existence of a unique stationary distribution and the existence 
of a solution to the Poisson equation which are actually induced by the theoretical properties of the 
AMALA sampling procedure. 

□ 



4-3. Convergence property of the estimated sequence generated by the 
AMALA-SAEM Algorithm 

We do the following assumptions on the model which are quite usual in the context of missing 
data model using EM-like algorithms (see Delyon et al (1999), Kuhn and Lavielle (2004)). 

For sake of simplicity we denote in the sequel pg(-) instead of p(-\y, 0) the posterior distribution. 

• (Ml) The parameter space is an open subset of W . The complete data likelihood function 
is given by: 

f(y,z;6)=exp{-Tp(6) + (S(z),<t>(6))}, 

where S is a Borel function on M. taking its values in an open subset S of K m . Moreover, 
the convex hull of S(M}) is included in S, and, for all in 6, 

\\S(z)\\p 8 (z)u(dz) < (X). 

• (M2) The functions ip and </> are twice continuously differentiable on O. 

• (M3) The function s : 9 -» S defined as 

s(9) = J S(z)pg(z)(i(dz) 

is continuously differentiable on O. 

• (M4) The function I : — > R defined as the observed-data log-likelihood 

1(6) ^ log g(y; 6)= log J f(y, z;0)»(dz) 
is continuously differentiable on and 

de / f{y,z;9)u(dz) = / d e f(y,z;6)u(dz). 
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• (M5) There exists a function 9 : S — > 0, such that: 

Vs es, ye e e, l{ s - §(s)) > l( S ; o). 

Moreover, the function is continuously differentiable on S. 

• (M6) The functions I : —> R and 8 : S — >• are m times differentiable. 

• (M7) 

(i) There exists an Mq > such that 

{s G S, dj(6(s)) = o} c {s G 5, -Z(0(s)) < Af } . 

(ii) For all Mi > M , the set ^^(S*^)) n I s € 5 : < M i} is a compact set of S. 

• (M8) There exists a polynomial function P of degree 2 such that for all z € A" 

||S(z)||<P(z). 

• (B3) For any compact subset /C of 5, there exists a polynomial function Q of the hidden 

variable such that sup |V Z \ogp&, Az)\ < Q(z). 
seic y ' 

Theorem 2 (Convergence Result for the Estimated Sequence generated by Algorithm 1). Assume 
(M1-M8) and (A4)- Assume that the family of posterior density functions {p§i s y s € S} satisfies 
assumptions (B1-B3). 

Let K C X be such that suppj^ (z) < oo and K.q C Wm H Conv(S(R 1 )) (where Mo is defined 
in (Ml)). Then, for all zq G K and sq G /Co, we have lim d(9k,C) = a.s. where (9k)k is the 
sequence generated by Algorithm 1 and C = {9 G Q : dgl{9) = 0}. 

The proof is postponed to Appendix 7.3. 



4-4- Central Limit Theorem for the estimated sequence generated by the 
AMALA-SAEM 



Theorem 2 ensures that the number of re- initializations of the sequence of stochastic approximation 
of Algorithm 1 is finite almost surely. We can therefore consider only the non truncated sequence 
when we are interested in its asymptotic behavior. 

Let us write the stochastic approximation procedure : 

Sfe = Sfc_i + 7 fe /i(s fc _i) + j k n k 



where i] k = S(z k ) - E pg(s _^ (S(z)), 
Let us introduce some assumptions. 

(Nl) The function h is C 1 in some neighborhood of s* which satisfies h(s*) = with first deriva- 
tives Lipschitz and J the Jacobean matrix of the mean field h in s* has all its eigenvalues 
with negative real part. 

(N2) Let £fc = g Sk _ 1 {z k ) — n Sfc l g Sf ._ 1 (z / ! C _i) where for any s G 5, g s is a solution of the Poisson 
equation g — Tl s g = H s —p s (H s ). Assume that for some matrix U and some random variable 
A>0, 

k 

lim E \xY,m-U)\ =0. 

L i=l 

(N3) The step size sequence (7^) is decreasing and satisfies either case 1 or case 2: 



Case 1 lim ( i L - 



= 0, 
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Case 2 for all k e N* 7fc = 1/fe. 

In Case 2, it requires also all eigenvalues of J have a real part lower than —1/2. 

Theorem 3. Under the assumptions of Theorem 2 and under (N1)-(N3), the sequence 

(sfe — s*)/^/7fe converges in distribution to a Gaussian random vector with zero mean and covariance 

matrix T and 

Vk,(6 k - 0*) Af(0,d s 9(s*)TdJ(s*Y) 

where 9* = 8(s*). 

The matrix T is characterized as the solution of some Lyapunov equation depending on the 
matrices U and J (see Theorem 4 in Section Appendix). The proof of Theorem 3 is given in 
Appendix 7.4. 

5. Applications on Bayesian Mixed Effect Template model 

We apply our estimation process on different databases. The first one is the USPS hand-written 
digit base as used by Allassonniere et al (2007) and Allassonniere ct al (2010b). The others two 
are medical images of 2D corpus callosum and 3D murine dendrite spine excrescences used by 
Allassonniere et al (2010a). 

We begin with presenting the experiments on the USPS database. In order to make comparison, 
we estimate the parameters in the same conditions as in the previous mentioned works that is to 
say the same 20 images per digit. Each image has grey level between (background) and 2 (bright 
white). These images are presented on the left panel of Fig. 1. We also use a noisy training dataset 
generated by adding a standardized independent Gaussian noise. These images are presented on 
the right panel of Fig. 1. We test five algorithms: the deterministic approximation of the EM 
algorithm (FAM-EM) presented by Allassonniere et al (2007), four MCMC-SAEM where the 
sampler is either the MALA, the adaptive MALA proposed by Atchade (2006), the hybrid Gibbs 
sampler presented by Allassonniere et al (2010b) and our AMALA algorithm. 
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Fig 1. Le/t: some images of the training set used for the estimation of the model parameters (inverse video). Right: 
same examples with additive noise. 

For these experiments the tuning parameters are chosen as follows: the threshold b is set to 

1.000, the scale 5 = 10~ 3 , the regularization e = 10~ 4 . The other tuning parameters and hyper- 
parameters are chosen as in Allassonniere et al (2010b). 

5.1. Computational performances 

We compare first the computational performances of the algorithms. The computational time is 
smaller for the three MCMC-SAEM algorithms using "MALA- like" samplers compared to the 
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FAM. Indeed, a numerical convergence of that algorithm requires about 30 to 50 EM steps. Each 
of them requires a gradient descent which has 15 iterations in average. This implies to compute 15 
times the gradient of the energy (which actually equals our gradient) for each image for each EM 
step. The " MAL A-like" -S AEM require about 100 to 150 EM steps (depending on the digit) but 
only one gradient is computed for each image at each step. This reduces the computational time 
by a factor of at least 4 (up to 7 depending on the digit). No comparison can be done when the 
data are noisy since the FAM-EM does not converges toward the MAP estimator as mentioned 
above. Comparing to the hybrid Gibbs-SAEM, the computational time is 8 times lower with the 
AMALA-SAEM in this particular case of application. Indeed, the hybrid Gibbs sampler requires 
no computation of the gradient. However, it includes a loop over the coordinates of the hidden 
variable, here the deformation vector of size 2k g — 72. At each of these iterations, the candidate is 
straightforward to sample whereas the computational cost lies into the acceptance rate. When this 
becomes heavy, the less times you calculate it, the better. In the AMALA-SAEM, this acceptance 
rate only has to be calculated once for each image. Therefore, even when the dimension of the 
hidden variable increases, this is of constant cost. The main price to pay is the computation of the 
gradient. Therefore, a tradeoff has to be found between the computation of either one gradient or 
dk g acceptance rates in order to select the algorithm to use in a given case. 

5.2. Results on the template estimation 

All the estimated templates obtained with the five algorithms and noise-free and noisy training 
data are presented in Fig. 2. As noticed by Allassonniere ct al (2010b), the FAM-EM estimation 
is sharp when the training set is noise-free and is deteriorated while adding noise. This behavior is 
not surprising with regard to the theoretical bound established by Bigot and Charlier (2011) in the 
particular case of compact deformation group. Considering the adaptive sampler, it does not reach 
a good estimation of the templates which are still very blurry and noisy in both cases. The problem 
seems to come from the very low acceptation rate already at the beginning of the estimation. The 
bad initial guess we have about the covariance matrix of the proposal seems to block the chain. 
Moreover, the tuning parameters are difficult to calibrate along the iterations of the estimation 
algorithm. Concerning the estimated templates using the Gibbs, MALA and AMALA samplers, 
they look very similar to each other using the noise-free data as well as the noisy ones. This 
similarity confirms the convergence of all these algorithms toward the MAP estimator. In this 
case, the templates are as expected, noise free and sharp. 

Nevertheless, when the dimension of the hidden variable increases, both the Gibbs and the 
MALA samplers show limitations. The Gibbs-SAEM would produce sharp estimations but ex- 
plodes the computational time. For this reason, we did not run this algorithm on higher dimension 
experiments. We run the estimation on the same noisy USPS database, increasing the number k g 
of geometrical control points. We choose the dimension of the deformation vector equal to 72, 128 
and 200. The results are presented in Fig. 3. Concerning the MALA sampler, it does not seem to 
capture the whole variability of the population in such high dimension. This yields a poorly esti- 
mation of the templates. This phenomenon does not appear using our AMALA-SAEM algorithm. 
The templates still look sharp and the acceptation rate remains reasonable. 

5.3. Results on the covariance matrix estimation 

We keep considering the USPS database. Since we are provided with a generative model, once 
the parameters have been estimated, we can generate synthetic samples in order to evaluate the 
constrained on the deformations that have been learnt. Some of these samples are presented in 
Fig. 4. For each digit, 20 examples are generated with the deformation given by +z and 20 others 
with — z where z is simulated with respect to Af(0,T g ). We recall that, as already noticed by 
Allassonniere et al (2010b), the Gaussian distribution is symmetric which may lead to strange 
samples in one direction whereas the other one looks like something present in the training set. 
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Fig 2. Estimated templates using the five algorithms and noise free and noisy data. The training set includes 20 
images per digit. The dimension of the hidden variable is 72. 
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Fig 3. Estimated templates using MALA and AMALA samplers in the stochastic EM algorithm on noisy training 
data. The training set includes 20 images per digit. The dimension of the hidden variable increases from 72 to 200. 
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With regards to the above remarks concerning the computational time and the template esti- 
mations, we present in this subsection only the results obtained using MALA and AMALA-SAEM 
algorithms. We notice that the samples generated by both algorithms look alike in the case of 
hidden variable of dimension 72. Thus, we present the results of our AMALA-SAEM estimation. 
As we can see, the deformations are very well estimated in both cases (without or with noise) and 
even look similar. This tends to demonstrate that the noise has really been separated from the 
template as well as the geometric variability during the estimation process. 
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Fig 4. Synthetic samples generated with respect to the BME template model using the estimated parameters with 
AMALA-SAEM. For each digit, the two lines represent the deformation using + and — the simulated deformation 
z. Left: data without noise. Right: data with noise variance 1. The number of geometric control points is 36 leading 
to a hidden variable of dimension 72. 



Increasing the dimension of the deformation to 128, we run both algorithms on the noisy 
dataset. We observe on Fig. 5 that the geometric variability of the samples remains similar to 
the one obtained in lower dimension using our AMALA-SAEM. However, the MALA-SAEM does 
not manage to capture the whole variability of the deformations which is related the results 
observed above on the template. This confirms the limitation of the use of MALA-SAEM in 
higher dimension. 



5-4- Results on the noise variance estimation 

The last check of the accuracy of the estimation relies in the noise variance estimation. The plots 
of their evolutions along the AMALA-SAEM iterations for each digit in both cases (without and 
with noise) are presented in Fig. 6. This variance is underestimated in particular in the noisy 
case, which is a well-known effect of the maximum likelihood (or in our case the MAP) estimator. 
We observe that the geometrically very constrained digits as 1 or 7 tend to converge very quickly 
whereas the digits 2 and 4 require more iterations to capture all the shape variability. 

Since this is a real parameter, we used it to illustrate the Central Limit Theorem stated in 
Subsection 4.4. Fig. 7 and Fig. 8 show the histograms of 10, 000 runs of the algorithm with the 
same initial conditions. We use the digits and 2 of the original data set as well as of the noisy 
data. As the iterations go along, the distribution of the estimates tends to look like a Gaussian 
distribution centered in the estimated noise variances which demonstrates empirically the Central 
Limit Theorem. 
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Fig 5. Synthetic samples generated with respect to the BME template model using the estimated parameters with 
AMALA-SAEM (left) and MALA-SAEM (right). For each digit, the two lines represent the deformation using + 
and — the simulated deformation z. The number of geometric control points is 64 leading to a hidden variable of 
dimension 128. 




Fig 6. Evolution of the estimation of the noise variance along the AMALA-SAEM iterations. Left: original data. 
Right: noisy data. 
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Fig 7. Empirical convergence toward the Gaussian distribution of the estimated noise variance along the AMALA- 
SAEM iterations for digit 0. Left: original data. Right: noisy data. 
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Fig 8. Empirical convergence toward the Gaussian distribution of the estimated noise variance along the AMALA- 
SAEM iterations for digit 2. Left: original data. Right: noisy data. 
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5.5. Classification results 

The dcformable template model enables to perform classification using the maximum likelihood 
of a new image to allocate it to one class, here the digit. We use the test USPS database (which 
contains 2007 digits) for classification while the training was done on the previous 20 noisy images. 
The results obtained with the hybrid Gibbs, MALA and AMALA-SAEM are presented in Table 1. 
In dimension 72, the best classification rate is performed by the hybrid Gibbs-SAEM. This is 
easily understandable since the sampling scheme enables to catch deformations which have been 
optimized control point by control point. Therefore, the estimated covariance matrix carries more 
local accuracy. The AMALA-SAEM leading in a much smaller computation time to estimates 
of the same quality provides also a very good classification rate. This confirms the good results 
observed on both the template estimates and the synthetic samples. Unfortunately, the MALA- 
SAEM shows again some limitations. Even if the templates look acceptable, the sampler does not 
manage to capture the whole class variability. Therefore, the classification rate falls down. 

In order to evaluate the stability of our estimation algorithm with respect to the dimension, 
we perform the same classification with more control points. As expected, the MALA-SAEM 
classification rate is deteriorated whereas our AMALA-SAEM keeps very good performances. 
Note that the hybrid Gibbs sampler was not tested in dimension 2k g — 128 because of its very 
long computational time. 



Sampler / 








Dim of Deformation 


Hybrid Gibbs 


MALA 


AMALA 


72 


22.43 


35.98 


23.22 


128 


X 


43.8 


25.36 



Table 1 

Error rate using the estimations on the noisy training set with respect to the sampler used in the MCMC-SAEM 
algorithm and the dimension of the deformation 2k g . The classification is performed on the test set of the USPS 

database. 



5.6. Medical image template estimation 

A second database is used to illustrate our algorithm. As before, in order to make comparisons 
with existing algorithms, we use the same database presented by Allassonniere et al (2010a). It 
consists of 47 medical images, each of them is a 2D square zone around the end point of the corpus 
callosum. This box contains a part of this corpus callosum as well as a part of the cerebellum. Ten 
exemplars are presented in the top rows of Fig. 9. 

The estimations are compared with these obtained with the FAM-EM and the hybrid Gibbs- 
SAEM algorithms and with the mean image (bottom row of Fig. 9). In this real situation, the 
Euclidean mean image (a) is very blurry. The estimated template using the FAM-EM (b) provides 
a first amelioration in particular leading to a sharper corpus callosum. However, the cerebellum 
still looks blurry in particular when comparing it to the shape which appears in the template 
estimated using the hybrid Gibbs SAEM (c). The result of our AMALA-SAEM is given in image 
(d). This template is very close to (c) as we could expect at a convergence point. Nevertheless the 
AMALA-SAEM has much lower computational time than the hybrid Gibbs-SAEM. This shows 
the advantage of using AMALA-SAEM in real cases of high dimension. 

5.7. 3D medical images 

We also test our algorithm in much higher dimension using the dataset of murine dendrite spines 
(see Aldridgc et al (2005); Ceyhan et al (2007a,b)) already used by Allassonniere et al (2010a). The 
dataset consists of 50 binary images of microscopic structures, tiny protuberances found on many 
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Fig 9. Medical image template estimation. Top rows : 10 Corpus callosum and cerebellum training images among 
the 47 available. Bottom row : (a) mean image, (b) FAM-EM estimated template, (c) Hybrid Gibbs - SAEM 
estimated template, (d) AMALA-SAEM estimated template. 

types of neurons termed dendrite spines. The images are from control mice and knockout mice 
which have been genetically modified to mimic human neurological pathologies like Parkinson's 
disease. The acquisition process consisted of electron microscopy after injection of Lucifer yellow 
and subsequent photo-oxidation. The shapes were then manually segmented on the tomographic 
reconstruction of the neurons. Some of these binary images are presented in Fig. 10 which shows 
a 3D view of some exemplars among the training set. Each image is a binary (background = 0, 
object = 2) cubic volume of size 28 3 . We can notice here the large geometrical variability of this 
population of images. Therefore we use a hidden variable of dimension 3fc g = 648 to catch this 
complex structure. 

The template estimated with either 30 or 50 observations are presented in Fig. 12. We obtain 
similar shapes which are coherent with what a mean shape could be regarding the training sam- 
ple. To evaluate the estimated geometrical variability, we generate synthetic samples as done in 
Subsection 5.3. Eight of these are shown in Fig. 11. We observe different twistings which are all 
coherent with the shapes observed in the dataset. Note that the training shapes have very irregular 
boundaries whereas the parametric model used for the template leads to a smoother image. Thus, 
the synthetic samples do not reflect the local ruggedness of the segmented murine dendrite spines. 
If the aim was to capture these local bumps, the number of photometrical control points has to 
be increased. However, the goal of our study was to detect global shape deformations. 

6. Conclusion 

In this paper we have considered the deformable template estimation issue using the BME model. 
We were particularly interested in the high dimensional setting. To that purpose, we have proposed 
to optimize the sampling scheme in the MCMC-SAEM algorithm to get an efficient and accurate 
estimation process. We have exhibited a new MCMC method based on the classical Metropolis 
Adjusted Langevin Algorithm where we introduced an anisotropic covariance matrix in the pro- 
posal. This optimization takes into account the anisotropy of the target distribution. We proved 
that the generated Markov chain is geometrically ergodic uniformly on any compact set. We have 
also proved the almost sure convergence of the sequence of parameters generated by the estimation 
algorithm as well as its asymptotic normality. We have illustrated this estimation algorithm in 
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Fig 10. 3D views of eight samples of the data set of dendrite spines. Each image is a volume leading to a binary 
image. 




Fig 11. 3D views of eight synthetic data. The estimated template shown in Fig. 12 is randomly deformed with 
respect to the estimated covariance matrix. 

the BME model. We considered different datasets of the literature namely the USPS database, 
2D medical images of corpus callosum and 3D medical images of murine dendrite excrescences. 
We have compared the results with previously published ones to highlight the gain in speed and 
accuracy of the proposed algorithm. 

We emphasize that the proposed estimation scheme can be applied in a wide range of application 
fields involving missing data models in high dimensional setting. In particular, this method is 
promising when considering mixture models as by Allassonniere and Kuhn (2010). Indeed, it will 
enable to shorten the computation time of the simulation part which in that case requires the 
use of many auxiliary Markov chains. This also provides a good tool for this BME model when 
introducing a diffeomorphic constrain on the deformations. In this case, it is even more important 
to get an efficient estimation process since the computational cost of diffeomorphic deformation is 




Fig 12. Estimated templates of murine dendrite spines. The training set is either composed of 30 (left) or 50 
(right) images. 
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intrinsically large. 
7. Appendix 

7. 1 . Proof of Proposition 1 

The idea of the proof is the same as the one of the geometric ergodicity of the random walk 
Metropolis algorithm developed by Jarner and Hansen (1998) and re-written by Atchade (2006) 
for its adaptive version of the MALA with truncated drift. The fact that both the drift and the 
covariance matrix are bounded even depending on the gradient of log tt enables some similar proofs. 
Let us first recall the transition kernel: 

IL(x,A) = / a(x, z)q c (x, z)dz + 1~a{%) I (1 — ct(x, z))q c (x, z)dz , (7-1) 
J A Jx 

where a[x,z) = min(l, p(x, z)) and p(x,z) = q^ q °j*f x ] ■ 

Thanks to the bounded drift and covariance matrix, we can bound the proposal distribution q c 
by two centered Gaussian distributions with covariance matrices e\Idi and t^ldi as follows: there 
exist two constants < k\ < k% such that for all (x, z) £ X 2 

k\g ei (x - z) < q c (x, z) < k 2 g t2 (x - z) , (7.2) 

denoting by g a the centered Gaussian probability density function in M. with covariance matrix 
aldi. 



7.1.1. Proof of the existence of a small set C 

Let K — B(0, R) be the ball in M. d centered at and of radius R > 0. For any x £ X and for any 
A£B: 



H(x,A)> / a(x, z)q c (x, z)dz . 
J AnK 

Let C be a compact subset of X and 

t = min p(x, z). Since p is a ratio of positive continuous functions in both variables, then 

xGC, z£K 

t > 0. Therefore, for all x £ C and for any A £ B: 

H(x, A) > min(l, r) / q c (x,z)dz. 

J AnK 

Moreover, thanks to (7.2), 

H(x, A) > fcimin(l,r) / g ei (z — x)dz 

J AnK 

> fcimin(l,r) / g ei (z)t Kc{A) (z)dz , 
Jx 

where since C is a compact subset of E rf , it is bounded and we can define K C (A) to be the 
intersection of all translations of A n K by any element of C. Note that we can choose R sufficiently 
large with respect to the size of C so that this intersection is not empty. 

Let £ = k\ min(l, t)Z and 
v(A) — -g- J x g ei (z)1x c (A) [z)dz where Z is the renormalisation constant. Thus, C is a small set for 
the transition kernel LI and (4.3) holds. 
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7.1.2. Proof of the drift condition 

As already noticed by Atchade (2006), we only need to prove the two following conditions: 

JIV(x) 



SUP ~T77\ 
xeX V{X) 



< oo 



and 



lira sup — — — - < 1 . 

|a|->oo V{X) 



(7.3) 



(7.4) 



We take the same path as Atchade (2006) applied to our case and refer to Fig 13. for a 2D 
visualization of all the spaces introduced along the proof. 




Cn[X)(U) 



n(x)>TT(y) 



Fig 13. 2D representation of the sets used in the proof. 

For any x € X, we denote A(x) = {z 6 X such that p(x, z) > 1} the acceptance set and R(x) = 
A(x) c the complementary set of A(x). Then, using V(x) = cir(x)~P for any < f3 < 1, 



UV(x) 
~V{x) 



f ^ y ( z )j , f n(z)q c (z,x) 

/ <lc(X,z)-—dz + / r 

)A(x) V{x) J r(x) ir(x)q c (x,z) 

f ( Wo. 

^q c (x,z)dz + 



q c (x, z 



lA(x) k{x) I 3 



ir(z) 1 -! 3 

R(x) n(x) 1 - 



V{z) f ( w(z)q c (z,x)\ 

— -dz + / 1 -— — -\q c (x,z)dz 

V{x) J R{x) V n{x)q c (x,z)J 



On the acceptance set A(x) 



T —^q c {z,x)dz+ / q c (x,z)dz 

Jr(x)' . ' 

f3(x,z) 



f2(x,z) 



n(x) 



c (x,z) < q c (z, x) p q c (x,z) 



1-/3 



Thanks to Equation (7.2) one can bound this term by the following symmetric Gaussian distribu- 
tion: 



n(z)- 



jq c (x, z) < k 2 g e2 (z - x) 



which yields: 



A(x) 



n(x)-P 

fi(x,y)dz<k 2 g e2 (z~x)dz. 



A(x) 



(7.5) 
(7.6) 
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Equivalently on R(x), we have the following bound: 



ir(z) 1 -' 3 



i u H Qc(z,x) < q c (x,z) 'q c {z,x) p <k 2 g e2 {z-x). (7.7) 
Let fix e > 0, there exists a > such that J* B , ^ ,g e2 (.z — x)c?z > 1 — e. This leads to: 

/ f 1 (x,z)dz < k 2 £ 

J A(x)nB(x,a) c 

Let C w ( x ) be the level set of n in x: C^m — {z € <-f : tt(z) = 7r(x)}. We define a pipe around 
this level set as 

C n ( x ){u) = {z + s n(z), \s\ < u, z e CVjx)}- 

Thanks to assumption (Bl), there exists n > such that for all x £ X satisfying \x\ > r\ then 
is inside the hyperspace defined by the level set C^t x ) (7r(0) > k{x)). Therefore, let x G X, \x\ > r\, 
then for all z € X , Etei G C^x) such that z = X\ + s n{x\). 

Since z i— > (^(z — x) is a smooth density in the variable z, we can find u > sufficiently small 
such that 

/ g £2 (z — x)dz < £ , (7.8) 

J B(x,a)nC wM (u) 

leading to 

/ fi(x,z)dz < k 2 £ . 



A(i)nB(i,a)nC l(l )(«) 

ir{x) 



Assumption (Bl) implies that for any r > and s > 0, d r (s) = sup ^^~7~p^ goes to as 

\x\>r 

r goes to oo. Denote C^^ x )(u) c+ — {z g C' 7r ( 2 ,)(M) c s.i. 7r(x) > 7r(z)} and Cw^m) 0- = {z G 
C 7r ( a; )(w,) c s.t. 7r(x) < 7r(z)}. Therefore there exists r2 > ri + a such that for any x : \x\> r 2 and 
using also the definition of the acceptance set A(x): 

[ fi(x,z)dz < [ ("7~t) q c (z,x)dz 

J A(x)C]B{x,a)nC 7 , {x) (uy+ J /l(i)nB(i,o)nC, (ll (u)<=+ V^W/ 

< d r2 (u) 1 ~^k2 / g e2 ( z ~ x)dz < k2d r2 (u) 1 ~@ , 



using equation (4.2) which states that the stationary distribution is decreasing in the direction of 
the normal of x sufficiently large. 
In the same way, one has: 



/ fi{x,z)dz < f 

JA(i)nB(i,a)nC.[,i(ii) c - JAt 



' A(x)r\B(x,a)r\C n{x) {u)"- J A(x)nB(x,a)r\C„( x )(u) a - \ 7r ( a; ) 

< k 2 d r2 {uf . 

The same inequalities can be obtained for f 2 using the same arguments: 

/ f 2 (x,z)dz < k 2 £ 

J R(x)nB(x,a) c 

f 2 (x,z)dz < k 2 £ 

fi(i)nB(i,a)nC l(l] («) 

f 2 (x 7 z)dz < fc 2 rfr 2 ( M ) 1 ^ 

fl(i)nB(i,a)nC, ((t) 

/ f 2 {x,z)dz < k 2 d r2 (u) p . 

J R(x)nB{x,a)nC lt(x) 



S. Allassonniere et al. /Efficient stochastic EM in high dimension 



25 



This yields 

limsup : 

|x|->-oo v ( x ) |sb] — >-oo JR(x) 



UV(x) 

limsup —limsup / q c (x,z)dz. (7-9) 



Let 



Q(x, A{x)) = J A{x) q c (x, z)dz, we get 



UV(x) 

limsup — — — - = 1 — liminf Q(x, A(x)). 

\x\-HX V\ X ) \x\-*CO 

Let us now prove that liminf Q(x, A(x)) > c > where c does not depend on x. 

\x\^rOO 

Let a fixed as above. Since q c is an exponential function, there exists eg > such that 

inf ^4^ c o- ( 7 - 10 ) 

z£B(x,a) q c [X, Z) 

Moreover, thanks to assumption (Bl) there exists > such that for all x £ X, \x\ > r%, there 
exists < u 2 < a such that, 

7 , ss <C%. (7.11) 

Hence, for |x| > any point xi = x — ui n(x) belongs to A{x). 
Let W(x) be the cone defined as: 

W(x) = [x 2 - sC, < s < a - u 2 , C G S d -\ |C - n(x 2 )\ < |} (7.12) 

where 6> d_1 is the unit sphere in R d . 

Let us prove that W(x) C ^4(x). 

Using assumption (Bl), we have for a sufficiently large x: m(x).n(x) < —e. Besides, by con- 
struction of W(x) for large x, for all z £ W(x), \n(z) — n(x)\ < e/2 with n(x) = n{x2) (see Fig 
13). This leads to for any sufficiently large x, for all z £ W(x), 

m(z).( = m(z).(C - n(x 2 )) + m(z).{n{x 2 ) - n(z j) + m(z).n(z) < e/2 + e/2 - e = . (7.13) 

Let now z = x 2 — s£ £ W(x). Using the mean value theorem on the differentiable function n 
between x 2 and z, we get that there exists r €]0, s[ such that ir(z) — tt(x 2 ) — — sQSjTt(x 2 — tQ. 
Using the definition of m, this implies that ir(z) — ir(x 2 ) — —s(.m(x 2 — rC)|V7r(a;2 — t()\ > 
thanks to Equation (7.13). Putting all these results together we finally get that for all z £ W(x), 
7r(z) > ir(x 2 ) > \tt(x). Moreover, as z £ B(x,a) as well, Equation (7.10) is satisfied, leading to 
z £ A(x). 

Then, we have 

Q(x,A(x)) = / q(x,z)dz 

JA(x) 



> / kig ei (z — x)dz 

J A{x) 



> ki g ei (z-x)dz 

'w(x) 



9e 1 {z)dz 

T«(W(x)) 



where 

T 



(W(x)) = [-u 2 n{x) - sC, 0<s<a-u 2 , (£S d -\ \t - n(x)\ < ^} (7.14) 
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is the translation of the set W[x) by the vector x. But since g ei is isotropic and T x (W(x)) only 
depends on a fixed constant 112 and n(x), this last integral is independent of x, so there exists a 
positive constant c such that: 

tl (z)dz. (7.15) 

JT x {W(x)) 

Back to our limit, 



JT m {W{x)) 

UV(r) 

limsup-^ < 1-c (7.16) 

\x\^oo V(X) 

which ends the proof of the condition (7.4). 

To prove (7.3), we use the previous result. Indeed, since ^p^y is a smooth function on X it 
is bounded on every compact subset. Moreover since the limsup is finite, then it is also bounded 
outside a fixed compact. This proves the results. 



7.2. Proof of Theorem 1 

We provide here the proof of the convergence of the stochastic approximation sequence. Proposition 
1 ensures the existence of a small set for each transition probability IT which satisfies also a drift 
condition. Therefore, each of this transition probability generates a uniform ergodic chain which 
admits an unique stationary distribution (see Meyn and Tweedie (1993)). This result combined 
with assumption (A2') implies assumption (A2) of Allassonniere et al (2010b). 

We now prove assumption (A3'). Let K. be a compact subset of S. The proof of the existence 
of a small set for each transition probability detailed in Section 7.1.1 can be easily adapted to 
construct a common small set C for all transition probabilities obtained for all s G K. where K. is 
compact. 

Moreover, each transition probability satisfies a drift condition: 

Vs G AC, Vz G X U s V s (z) < X s V s (z) + b s t c (z), (7.17) 

where A s G]0, 1[, b s > and V s (z) = c l ^nj l3 (z) with c s = maxn s (z). Note that assumption (Bl) 

ensures that < c s < 00. 

We now have to prove that there exist a function V, constants meN*, < A m < 1 and b > 
independent of s such that 

Vz g x sup n™y(z) < x m v(z) + bi c (z). 

seK 

Thanks to assumption (B2), there exists a constant eg uniform in s with respect to the compact 
K such that Equations (7.10) and (7.11) still hold for all s G K. This implies that the set T x (W(x)) 
is independent of s G /C. Therefore, we can set A — 1 — c < 1 where c is defined in Equation (7.15) 
and is also independent of any s in the compact K,. The same arguments hold for b. 

Let si € S be fixed arbitrarily and define V(z) = c% TTjf(z). Since K. is compact, there exist 
two constants (01,02) such that Vs G K Vz G X ciir Sl (z) < ir s (z) < C2TT Sl (z). 

Then, we get 

uTv(z) < 4 x 4c-en™v s {z) 

< cP ai 4c-P(X m V 8 {z) + 6/(1- A)) 

< c^4x m V(z) + 4 i 4c^b/(l-\) 

< (c 2 /c i yX m V(z) + 4 1 4c^b/(l-X) 

where c/c = min sS yc c s > 0. Choosing m large enough such that A m = (c 2 /ci) /3 A m < 1 ensures the 
result. 
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The geometric ergodicity of the Markov chain also ensures the existence of a solution of the 
Poisson equation (see Meyn and Tweedie (1993)). Thus, assumption (A3") implies assumption 
(A3') in the context of Anisotropic Metropolis Adjusted Langevin transition probability and the 
stated convergence result holds. 

7.3. Proof of Theorem 2 

We provide here the proof of the convergence of the estimated sequence generated by Algorithm 
1. 

We apply Theorem 1 with the functions H s equals to H s (z) — S(z) — s, tt s = P§/ s \ and 

H s ) = J (S{z) - s)p §{s) {z)fi{dz) . 

Let us first prove that assumption (Al') is satisfied. Assumptions (M1)-(M6) ensure that S 
is an open subset and that the function h is continuous on S. Moreover defining w(s) = —l(9(s)), 
we get that w is continuously differentiable on S. Applying Lemma 2 of Delyon ct al (1999), we 
get (Al')(i), (Al')(iii) and (Al')(iv). 

To prove (Al')(ii), we consider as absorbing set <S Q the closure of the convex hull of S(M. 1 ) 
denoted Conv(S(R 1 )). So assumption (M7)(ii) is exactly equivalent to assumption (Al')(ii). 

This achieves the proof of assumption (Al'). 

Assumption (A2') is directly implied by assumption(Ml). 
Let us now prove assumption (A3"). 

Let si £ S be fixed arbitrarily and define V(z) = c^pT* 9 Az), with c Sl = max z ^x P§( S1 )( Z )- Let 
also K be a compact subset of S. 

We first consider condition (A3"(i)). 

Since H s (z) = S(z) — s, assumptions (M8) and (Bl) ensure that there exists a constant C > 
such that sup sup F ^ < C and inequality (4.8) of (A3"(i)) holds. 

sG/C z£X P 8( S1 ) (Z ' 

The uniform ergodicity of the family of Markov chains corresponding to the AMALA on K. ensures 
that there exist constants < 7jc < 1 and Cic > such that for all s £ JC 



8up||fl5 W llv = 




< supJ2 C K-1k\\ h s\\v < oo . 

seK k>o 

Thus Vs £ /C, g§,. belongs to C v = {g : R l -> M, \\g\\ v < oo}. 

Repeating the same calculation as above, it is immediate that sup |||n^ s Nf?g( s J| is bounded. This 

s£fC 

ends the proof of inequality (4.9) of (A3"(i)). 

We now move to the Holder condition (4.11) of (A3"(i)). We will use the following lemmas 
which state Holder conditions on the transition kernel and its iterates: 

Lemma 1. Let K, be a compact subset of S . There exists a constant Ck. such that for all 1 < p < q, 
for all function f £ Zy v and for all (s, s') £ K- 2 we have : 

lin^Z-n^/Hv-^Cjcll/llv* h-s'\\. 
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Proof. For any / G £yp, we have 

IW(*)= / /(z>(z^% c (z,z^)dz' + /(z)(l- a (z,tf)) , 

where a(z,z',8) = min (l, ^0^^ ) and a(z,0) = / a(z, z', 6>)g c (^, z'; <9)dz' is the average 
acceptance rate. 

Let s and s' be two points in JC and s(e) = (1 — e)s + es' for e e [0, 1] be a linear interpolation 
between s and s' (since 5 is convex, we can assume that JC is a convex set so that s(e) £ JC for 
any e <G [0, 1]). We denote also by 9(e) = 0(s(e)) the associated path in 9 which is a continuously 
differcntiable function. 

We introduce TL\f(z) = (1 - a(z,6>))/(z) and ng/(s) = / R „ f(z')a(z, z' ,9)q c (z, z' ,9)dz' . We 
start with the difference \(Hgtuf — ^-0(o)f)( z )\- 



We have: 



l(I^(i)/ -I^ ( o)/)WI< / / !/(*')! * 



o Jr' 



tie 



(a(z,z',0(e)) gc (z,z';0(e))) 



dz'efe . 



Since a(z, z' , 9) = min(p(z, z', 9), 1) where p(z, z' , 9) - pe(z)qc{z z ,. 
we have almost every where: 



' n — p° (z '^ (z ' '-^ is a smooth function in 0, 



de 



(a(z,z',9(e))q c (z,z';9(e))) 



< q c (z,z';9(e))x 



de 



(log 71-0(^(2') - log7r 0(e) (z)+ log q c (z,z'; 9(e))) 



< C/cq c (z, z'; 0( e ))||— V z log7r e(e) (z) 

< C K q c (z,z';e(e))R(z,z'), 



1 + \\S(z') S(z)\\\\ - m e))\\+\\z' z 5D(z, 9(e))\\ 2 



where R is a polynomial function of both variables z and z' , with degree 2 in z' (thanks to 
assumption (M8) and (B3)). 

Since \f(z')\ < \\f\\ V pV p (z'), there exists an C K such that for all (s,s') e JC 2 , for all z € R', 

/" / 1/(^)1 g c («, ^; «(e))J2(^, a/)^^ < Cjc||/||vp / / Vr(z')R(z,z')q c (z, z'; #(e))dz'de . 

jR 1 Jo JR ! 



Thanks to assumption (M8) and the normality of q c , we have for all < j3 < 1 and 1 < p 
satisfying < /3p < 1 : 

sup / y p (z')i?(z, z')q c (z, z'; 0(s))dz' < Oei?(z) , 

where R is a polynomial function. 
Therefore we end up with: 

\(Ul {1) f-Ul {Q) f)(z)\ < C K \\f\\ v 4s-s'\\R(z). 

Moreover, thanks to the super exponential property (Bl) of p$, we have that for all q > : 

R(z) 



sup 



< CO . 



This implies that: for all / e 



(n| w /-n| (8f) /)|| v « < c^H/llvplla-*'!!. 
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Now, looking at the first term, 

l(nS (1) /-nS ( o)/)(*)l = Hz,9(l))-a(z,9(0))\\m\ 
= \(U 2 eil) l-U 2 m l)(z)\\f(z)\ 

< C K \\l\\ v ,\\a-tJ\\R(z)\f(z)\, 

where we define 1 to be the constant function equals to 1 everywhere. 
Following the same lines as previously, we get 

ml { i)f-^mf)^)\<Cic\\\\s'-s\\ R(z)\\f\\ v ,V*(z). (7.18) 
Since R is polynomial, we obtain for all q > p and for all / € Cyp, 

ml^f-Ul^mv, < Cfc||/||v,||a-*'||. 
Thus, adding both terms, we get (updating again C/c) that 

||(n e(1) - n e(0) )/|| y , < CfcH/IMK - «|| , (7.19) 
which ends the proof of the lemma. □ 



Lemma 2. Let K be a compact subset of S . There exists a constant such that for all 1 < p < q, 
for all function f € Lyp , for all (s, s') £ K- 2 and for all k > 0, we have: 

H^f-^^fWv <C,c\\f\\v4s-s'\\ . 

Proof. We use the usual decomposition of the difference denoting 8 = 8(s) and 9' = 6(s') : 

fc-i 

nS/ - Kf = E n «( n « - n^jen*-*- 1 / -Mf)) ■ 

i=l 

Using Lemma 1, the fact that ||IIg(/ — Pe{f))\\vp < CjcT/cII/IIvp with 7^ < 1 (geometric 

ergodicity) and sup sup ||IfL < 00 we get: 

j>0 s£K &{ - s > 



fc-i 

\U k e f-U k e ,f\\ vq < CcY.W^o-Ug^xiU^f-pe'UMv^ 



k-1 

< C,c\\f\\ v 4s-s'\\J27 k K i+l 

and the lemma is proved. □ 

Thanks to the proofs of Allassonniere et al (2010b), we get that h is a Holder function for any 
< a < 1 which leads to (A3"(i)). 

We finally focus on the proof of (A3"(ii)). 

Lemma 3. Let K be a compact subset of S and p > 1. There exists Ck. > such that for any 
s, s' £ fC, for any z G X , 
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Proof. We calculate first the differential in s of the function Pjfy We get for all z € X: 
and 

\d sP ^(z)\ < P P P^jWld, log(p 5w (z))| . 

We have for all z e X: 

exp U(9(s)) + (S(z),m s )))) 
8{S) /exp (^(,s)) + (S(z), #?>)))) dz' 

which leads to 

d s \0g( Pd(s) (z)) = dMK*)) + (S{z),d.<f f (§(8))) 

fexp(i>(d(s)) + (S(z),<f>(d(s))) 



/exp (i>0(a)) + (S(z),tf>0(a))) 

(d a ll>0(B)) + {S(z),d.4>(§{8))) 



dz 

dz 



Jexp(jP(e(s)) + (S(z),<f>(6(s))) 



dz 



(7.21) 



We can deduce from assumptions (M2), (M5) and (M8) that there exists a constant C/c such 
that: for all s G K 

\d^(j> hs) (z))\<C K P(z), 

which implies that 

\d s p7^{z)\<C K pl3p-^{z)P{z). 
Introducing again < c\ < C2 such that 

for any (z.s) G A" x JC we deduce that 



|9 s p7j( 2 )|<4c K p^pr^( 2 )p( 2 ). 



(7.22) 



Using the mean value theorem, we get 
We recall that V^ (s) (z) = cPpgj 3 (z). This yields 

Since c s = maxp^^z) and using the smoothness of the target distribution with respect to both 

the parameter and the hidden variable, we can bound c s on any compact subset JC by Cjq and 
using Equation (7.22) we have: 

We now focus on the estimation of |cf p — Cy P |. 
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Let us note that: 



\4 P -c P /\ = Imaxp^^-maxp^^z) 



We then use the mean value theorem as before: 



and get: 

This yields 
and thus 



\d sP l P (s) (z)\ < C K pi3p^z)P{z) 

HU {z) - p t^ * C K p^\\s-s'\\ P f {si) {z)P{z) 
< C K pp\\s-s'\\4fP{z). 

r 0P _ JP 



\c^-c p s ?\<C K P{z)\\s~s'\\ 

I^W-^WI ^ 0' K \\s-s'\\p^ ){ z)P{z) 

< C'^Ws-s'WV^Piz), 

which concludes the proof of Lemma 3. □ 

Lemma 4. Let K, be a compact subset of S, p > 1 and P be a polynomial function on X. There 
exists [i^. > such that for any z € X , for any sgK, we have 

II «(.)%) P W^^% ) W. (7.24) 

Proof. This proof is similar to the proof of Proposition 1 and in particular of Equation (7.3). The 
key property leads in an equivalent inequalities sequence given in Equation (7.2) including the 
polynomial function P. As before, we have : thanks to the bounded drift and covariance matrix, 
there exist two constants < fci < ki such that for all (x, z) € X 2 

kig ei (x - z) < q c (x, z)P(z) < k 2 g t2 (x - z) , (7-25) 

denoting by g a the centered Gaussian probability density function in M. 1 with covariance matrix 
aldi. Therefore we get, with the condition that /3p < 1 : 

m ,[l] - / 0P -q c (x,z)P{z)dz+ f ^—^-q c {z,x)P(z)dz+f q c (x, z)P(z) dz . 



R(a 



h(x,y)P(z) f 2 (x,z)P(z) 



h(x,z)P(z) 



Equation (7.25) enables to bound the two first elements as already done and thus they converge 
toward as \x\ — > oo. Considering now the last term, we can use the same bound : 

q c (x, z)P{z)dz < / fc 2 g £2 (a; — z)dz < fc 2 . 
R(x) Jr(x) 

Then, taking the limsup over |a;| tending to infinity provides the result. 

□ 

Lemma 5. Let K, be a compact subset of S and p > 1 such that p(3 < 1. For all sequences 
1 = {lk)k>o and e = {£k)k>o satisfying < I for some e sufficiently small, there exists C/c > 0, 
such that for any z £ X , we have 

supsu P E7J^(z fc )l CT(yc)Al/(e) > fc ] < C K W(z). 

s£K k>0 
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(7.26) 
(7.27) 



Proof. Let K be a compact subset of such that 6(JC) C K. We note in the sequel, Ok = 0{sk) 
We have for k > 2, using the Markov property and the drift property (7.17) for Vg, 

< AEJ j .[(V^_ 1 (2! fc _ 1 )l tr( x : )AK.)>*] + c 

Moreover, with the same arguments and Lemma 3, 

EIJ^Li^- 1 ) 1 ^^^)^] = E l,s[( V 9 k . 1 ( Z k--L) ~ V e P fe 2 (z fc _i))l (T(K;)Al/(£) > fe ] 

+ E lJ^f fc _ 1 ( 2: fe-i)l<T(K;)A 1 ,( £ )>fe] 

< C x: e fc _iEJ i ,[II flit _ a vy fc _ il P(« fc _a)l ff( jc}Av(.)>fc] 
+ E 7, s [ n e fc -2 1/ e P fc _ 2 :,1 ff(K;)A I .( e )>fc] 

< C^Efc-l fJ>KMl,s We k - 2 ( z k-2)^a(lC)Au(e)>k] 
+^ls[ V l- 2 ( Z ^)MlC)^(e)>k} + C 

< C + A(A + C^-iMc) x EJ ]S [V^_ 2 (^_ 2 )1 
Iterating this recursively leads to : 

fc-ife-i 

i+x]n> +c ^MJc) 

fc-i 
_j'=i 

which can be injected in Equation (7.26) and yields : 

fe-i 



(7.28) 

E?,.Wo(«)l«r(JC)AK«)>*]. ( 7 - 29 ) 



E Z s [ v eI_ 1 ( z fe) ]1 ^(K;)A i ,( £ )>fe] < A 



[J (A + CfcEjUK.) 



3 = 1 



c 



fc-1 fc-1 

J=2 i=j 



Since A < 1, we can choose e such that p = A + C^e/i^; < 1 which ensures that for all k £ N, 
we have : 



fe-2 



< TO 1 + 



C 



1-P 



definitions of V and Vg we get the result. 

This yields (A3"(ii)) under (A4) and allows finally to apply Theorem 1. 



Using < C\ < c 2 such that Cip~f .(z) < pa (z) < c 2 p-, Az) for any (z,0) £ M 1 x K and the 

6>(si) " 9(si) 

□ 



7-4- Proof of the Central Limit Theorem for the Estimated Sequence 

To prove this result we consider Theorem 24 of Delyon (2000): Consider the stochastic approxi- 
mation with a noise of the form: 

s k = s k -i + 7fc/i(sfc_i) + 7fc6c + v k - Vk-i + r k (7.30) 
and assumption (C) given by 
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(C) The algorithm defined by Equation (7.30) is A-stable, the sequence (sfc) converges toward 
some limit s* . The function h is C 1 in some neighborhood of s* with first derivatives Lipschitz 
and J the Jacobean matrix of the mean field h in s* has all its eigenvalues with negative 
real part. 

Theorem 4 (Theorem 24 Delyon (2000)). Let assumptions (C) and (N3) be satisfied, and if 
lim kjk — 1 we require that 7/ t = 1/k. Furthermore assume that for some matrix U, some e > 

k— ¥ OO 

and some positive random variables X, X 1 , X" : 

The sequence is a J 7 — martingale (7-31) 
sup||&|| 2 + e < 00 (7.32) 

lim 7fe 1/2 ||^l|i -0 (7.33) 

k—¥oo 

lim 7 y 2 ||XV||i -0 (7.34) 

k— >-oo 
fc 

lim 7fc||*"£(&S-E/)lli=0 (7-35) 

i=l 

where J- = (J-i)i<=n is the increasing family of a— algebra generated by the random variables 
(s ,zx, —,Zi). Then 

^AA(0,y) (7.36) 



VTfe 

where V is the solution of the following Lyapunov equation depending on the cases of assumption 
(N3): 

Case 1: U + JV + VJ' = 0, 

Case 2: U + (J + 7/2)V + K( J + 7/2)' = 0. 

Unfortunately, we are not able to prove the A-stability required in assumption (C) and to prove 
our convergence we use a relaxed version of this condition. This new assumption is a sufficient 
condition to prove the required result as can be seen when following the proofs of Delyon (2000). 
We therefore define a new assumption (C). 

(C) The sequence Sk converges to some limit s* , the function h is C 1 in some neighborhood of 
s* with first derivatives Lipschitz and J the Jacobean matrix of the mean field h in s* has 
all its eigenvalues with negative real part. 

The result of Theorem 24 still holds under this alleviated assumption. Indeed, it is sufficient to 
establish that the random variable 

-1/2 k ... * 

7 fe ^2 ex P((^fe — U)J)liTi converges toward in probability where tj = y] jj. Theorem 19 and 

i=0 j=l 

Proposition 39 of Delyon (2000) can be applied in expectation. Theorems 23 and 20 of Delyon 
(2000) also still hold with assumption (C) instead of (C). 

We have now to prove that the assumptions of Theorem 24 of Delyon (2000) replacing (C) by 
(C). 

We decompose the remainder term as follows: 

Vk = £,k + vk- Vk-i + r k (7.37) 

with 

6 = g Sk -Azk) -~n Sk - 1 g Sk - 1 ( z k-i) (7.38) 

v k = -U Sk g Sk (z k ) (7.39) 
r k = Tls h 98 h (zk)-'n.s k - 1 9s h - 1 (zk) (7.40) 
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where for any s € S, g s is a solution of the Poisson equation g — Tl s g = H s — p s (H s ). 

Thanks to the assumptions of Theorem 2 which provides the convergence of the sequence (sfe) 
and assumption (Nl) we get (C). 

By definition of £j it is obvious that (7.31) is true. Moreover, the following lemma proves that 
there exists e > such that (7.32) holds with X = 1. 

Lemma 6. For all e > such that /3(2 + e) < 1, the sequence is in L 2+e . 
Proof. We use the convexity of the function x >— > x 2+s . Indeed, we have 

\g Sk .M) - n s ,_ l5sfc _ 1 (^_ 1 )| 2+E < (\g Sh _M)\ + in^^^-i)!) 2 ^ 

where C e = ^+7. 

Using assumptions (M8) and (Bl), we get applying the drift condition: 

E(IN| 2+e |^-i) < C £ (E(\g Sk _ 1 (z k )\ 2+e I Fk-i) + \E(U Sk _ 1 g Sk _ 1 (z k -i)\J : k -i)\ 2+£ ) 
< C nV{z k f +£ + V(zk-i) 2+s \J 7 k-i)) < C (AV a+e (« fc _!) + 1) . 

Finally taking the expectation after induction leads to: 

E(\\e k +e \\)<CV 2 ^(z a )<+oo. 

□ 

Let us now focus on Equation (7.33). Thanks to the Holder property of our kernel and the fact 
that H Sh belongs to Cy'- 



Wrj! = E[\U Sn g 8 Jz n )-U 8n _,g gn _,{z n )\] 

< CE[V q (z n )\s n -s n ^\ a } 

< CE[^ +1 (z„M 

< cr n 

where the last inequality comes from the drift property. Since the Holder property is true for any 
< a < 1, we can choose a > 1/2 which leads to the conclusion. 

To prove Equation (7.34), we note that using the drift condition as in the previous lemma, 
E[|t%|] is uniformly bounded in k. Since the step-size sequence (7^) tends to zero, the result fol- 
lows with X" = 1. 

Equation (7.35) is an assumption of our asymptotic normality theorem and Theorem 24 applies. 
The Delta method enables to get the result on the sequence (8 k ) achieving the proof of our Central 
Limit theorem. 
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